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Abstract 

We show how a deep denoising autoencoder with lateral connections can be used 
as an auxiliary unsupervised learning task to support supervised learning. The 
proposed model is trained to minimize simultaneously the sum of supervised and 
unsupervised cost functions by back-propagation, avoiding the need for layer- 
wise pretraining. It improves the state of the art signihcantly in the permutation- 
invariant MNIST classihcation task. 


1 Introduction 


Combining an auxiliary task to help train a neural network was proposed by Suddarth and Kergosien 
( 1990| l. By sharing the hidden representations among more than one task, the network generalizes 
better. jHinton and Salakhutdinov ( 2006|l propo sed that this auxiliary task could be unsupervised 
modelling of the inputs. [Ranzato and Szummer ( |2008| l used autoencoder reconstruction as auxiliary 
task for classihcation but performed the training layer-wise. 


Sietsma and Dow (1991 1 proposed to corrupt network inputs with noise as a regularization method. 
Denoising autoencoders ([Vincent et a/.'||2010| l use the same principle to create unsupervised models 
for data. Rasmus et al. ( 2015| l showed that modulated lateral connections in denoising autoencoder 
change its properties in a fundamental way making it more suitable as an auxiliary task for super¬ 
vised training; 


• Lateral connections allow the detailed information to How directly to the decoder relieving 
the pressure of higher layers to represent all information and allowing them to concentrate 
on more abstract features. In contrast to a deep denoising autoencoder, encoder can discard 
information on the way up similarly to typical supervised learning tasks discard irTelevant 
information. 

• With lateral connections, the optimal model shape is pyramid like, i.e. the dimensionality 
of the top layers is lower than the bottom layers, which is also true for typical supervised 
learning tasks, as opposed to traditional denoising autoencoders which prefer layers that 
are equal in size. 


This paper builds on top the previous work and shows that using denoising autoencoder with lateral 
connections as an auxiliary task for supervised learning improves network’s generalization capa¬ 
bility as hypothesized by |Valpoia] ( |2015[ ). The proposed method achieves state-of-the-art results in 
permutation invariant MNIST classihcation task. 


2 Proposed Model 

The encoder of the autoencoder acts as the multilayer perceptron network for the supervised task 
so that the prediction is made in the highest layer of the encoder as depicted in Figure For the 
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Figure 1: The conceptual illustration of the model when L = 3. Encoder path from x —> y is a 
multilayer perceptron network, bold arrows indicating fully connected weights up¬ 
wards and downwards and thin arrows neuron-wise connections, are normalized 

preactivations, their denoised versions, and x denoised reconstruction of the input, are 

projections of in the dimensions of z(^\ are the activations and y the class prediction. 


decoder, we follow the model by [Rasmus et al. 


and other minor modifications described in Section 12^ 


but with more expressive decoder function 


2.1 Encoder and Classifier 


We follow Ioffe and Szegedy ( |2015| l to apply batch normalization to each preactivation including 
the topmost layer in L-layer network to ensure fast convergence due to reduced covariate shift. 
Formally, when input = x and I = 1... L 




where Nb is a component-wise batch normalization Nsixt) = where and are 

estimates calculated from the minibatch, 7 ^*^ and jsf'^ are trainable parameters, and </>(•) = max( 0 , •) 
is the rectification nonlinearity, which is replaced by the softmax for the output y = . 


As batch normalization is reported to reduce the need of dropout-style regularization, we only add 
isotropic Gaussian noise n to the inputs, x = = x + n. 

The supervised cost is average negative log probability of the targets t{n) given the inputs x(n) 

1 ^ 

Cciass = = t{n) I x(n)). 

n—1 


2.2 Decoder for Unsupervised Auxiliary Task 

The unsupervised auxiliary task performs denoising similar to traditional denoising autoencoder, 
that is, it tries to match the reconstruction x with the original x. 

Layer sizes in the decoder are symmetric to the encoder and corresponding decoder layer z*^*) is 
calculated from lateral connection z*^*^ and vertical connection Lateral connections are re¬ 

stricted so that each unit i in an encoder layer is connected to only one unit i in the corresponding 
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decoder layer, but vertical connections are fully connected and projected to the same space as 
by 

and lateral neuron-wise connection for the iih neuron is 

+ o.i2cr{ai^Zi + a^4) -t- a^5, 

Clij — CijUi diji 

where superscripts are dropped to avoid clutter, cr(-) is the sigmoid nonlinearity, and and 
are the trainable parameters. This type of parametrization allows the network to use information 
from higher layer for any The highest layer L has = 0 and the lowest layer x = and 

z(°) = X. 


Valpolal (2015[ Section 4.1) discusses how denoising functions represent corresponding distribu¬ 


tions. The proposed parametrization suits many different distributions, e.g. super- and sub-Gaussian, 
and multimodal. Parameter aa defines the distance of peaks in multimodal distributions (also the 
ratio of variances if the distribution is a mixture of two distributions with the same mean but dif¬ 
ferent variance). Moreover, this kind of decoder function is able to emulate both the additive and 
modulated connections that were analyzed by Rasmus et al. (2015 1 . 


The cost function for unsupervised path is the mean squared error, rix being the dimensionality of 
the data 

C'reconst “ ~ jy 1~ ^(''^) 11 

The training criterion is a combination of the two such that multiplier rj determines how much the 
auxiliary cost is used, and the case rj = 0 corresponds to pure supervised learning: 


C — C^class “f tyGreconst 


The parameters of the model include 7 ^^^, and for the encoder, and and 

for the decoder. The encoder and decoder have roughly the same number of parameters because the 
matrices equal to in size. The only difference comes from per-neuron parameters, which 
encoder has only two ( 7 ^ and /3i), but the decoder has ten (c^ and dij, j = 1... 5). 


3 Experiments 


In order to evaluate the impact of unsupervised auxiliary cost to the generalization performance, we 
tested the model with MNIST classification task. We randomly split the data into 50.000 examples 
for training and 10.000 examples for validation. The validation set was used for evaluating the 
model structure and hyperparameters and finally to train model for test error evaluation. To improve 
statistical reliability, we considered the average of 10 runs with different random seeds. Both the 
supervised and unsupervised cost functions use the same training data. 


Model training took 100 epochs with minibatch size of 100, equalling to 50.000 weight updates. 
We used Adam optimization algorithm ( [Kingma and Ba 2015[ ) for weight updates adjusting the 
learning rate according to a schedule where the learning rate is linearly reduced to zero during the 
last 50 epochs starting from 0.002. We tested two models with layer sizes 784-1000-500-10 and 784- 
1000-500-250-250-250-10, of which the latter worked better and is reported in this paper. The best 
input noise level was a = 0.3 and chosen from {0.1,0.3, 0.5}. There are plenty of hyperparameters 
and various model structures left to tune but we were satisfied with the reported results. 


3.1 Results 

Figure [^illustrates how auxiliary cost impacts validation error by showing the error as a function of 
the multiplier p. The auxiliary task is clearly beneficial and in this case the best tested value for p is 
500. 

The best hyperparameters were chosen based on the validation error results and then retrained 10 
times with all 60.000 samples and measured against the test data. The worst test error was 0.72 %, 
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Figure 2; Average validation error as a function of unsupervised auxiliary cost multiplier p and 
average test error for the cases rj = 0 and p = 500 over 10 runs, rj = 0 corresponds to pure super¬ 
vised training. Error bars show the sample standard deviation. Training included 50.000 samples 
for validation but for test error all 60.000 labeled samples were used. 


Method 





Test error 

SVM 





1.40% 

MP-DBM 

Goodfellow et al. 

(2013 1 


0.91 % 

l iiis worJc, 77 = U 




0.89 % 

Manifold Tangent Classiher 

Rifai et al 

(2011 1 

0.81 % 

DBM pre- 

train + Do 

Srivastava et al. ( 

ioi4|l 

0.79 % 

Maxout + Do + adv Goodfellow et at. 


0.78 % 


This work, rj = 500 0.68 % 


Table 1; A collection of previously reported MNIST test errors in permutation-invariant setting. Do; 
Dropout, adv: Adversarial training, DBM: deep Boltzmann machine. 


i.e. 72 misclassified examples, and the average 0.684 % which is signihcantly lower than the previ¬ 
ously reported 0.782 %. For comparison, we computed the average test error for the 77 = 0 case, i.e. 
supervised learning with batch normalization, and got 0.89 %. 


4 Related Work 


Multi-prediction deep Boltzmann machine (MP-DBM) (Goodfellow et al. 20131 is a way to train a 
DBM with back-propagation through variational inference. The targets of the inference include both 
supervised targets (classihcation) and unsupervised targets (reconstruction of missing inputs) that 
are used in training simultaneously. The connections through the inference network are somewhat 
analogous to our lateral connections. Specihcally, there are inference paths from observed inputs to 
reconstructed inputs that do not go all the way up to the highest layers. Compared to our approach, 
MP-DBM requires an iterative inference with some initialization for the hidden activations, whereas 
in our case, the inference is a simple single-pass feedforward procedure. 


5 Discussion 

We showed that a denoising autoencoder with lateral connections is compatible with supervised 
learning using the unsupervised denoising task as an auxiliary training objective, and achieved good 
results in MNIST classihcation task with a signihcant margin to the previous state of the art. We 
conjecture that the good results are due to supervised and unsupervised learning happening concur¬ 
rently which means that unsupervised learning can focus on the features which supervised learning 
hnds relevant. 

The proposed model is simple and easy to implement with many existing feedforward architectures, 
as the training is based on back-propagation from a simple cost function. It is quick to train and 
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the convergence is fast, especially with batch normalization. The proposed architecture implements 
complex functions such as modulated connections without a significant increase in the number of 
parameters. 

This work can be further improved and extended in many ways. We are currently studying the impact 
of adding noise also to and including auxiliary layer-wise reconstruction costs | —z^^^ | p, and 

working on extending these preliminary experiments to larger datasets, to semi-supervised learning 
problems, and convolutional networks. 
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